TeiVM2, Main, Exploration, bibRecord, 000170

You Don’t Have to Think Twice if You Carefully Tokenize

Identifieur interne : 000170 ( Main/Exploration ); précédent : 000169; suivant : 000171

You Don’t Have to Think Twice if You Carefully Tokenize

Auteurs : Stefan Klatt [Allemagne] ; Bernd Bohnet [Allemagne]

Source :

Lecture Notes in Computer Science [ 0302-9743 ] ; 2005.

RBID : ISTEX:6800DF6D171E421B4E2D105EF08CFA86A02E8475

Abstract

Abstract: Most of the currently used tokenizers only segment a text into tokens and combine them to sentences. But this is not the way, we think a tokenizer should work. We believe that a tokenizer should support the following analysis components in the best way it can. We present a tokenizer with a high focus on transparency. First, the tokenizer decisions are encoded in such a way that the original text can be reconstructed. This supports the identification of typical errors and – as a consequence – a faster creation of better tokenizer versions. Second, all detected relevant information that might be important for subsequent analysis components are made transparent by XML-tags and special information codes for each token. Third, doubtful decisions are also marked by XML-tags. This is helpful for off-line applications like corpora building, where it seems to be more appropriate to check doubtful decisions in a few minutes manually than working with incorrect data over years.

Url:

https://api.istex.fr/document/6800DF6D171E421B4E2D105EF08CFA86A02E8475/fulltext/pdf

DOI: 10.1007/978-3-540-30211-7_32

Affiliations:

Links toward previous steps (curation, corpus...)

to stream Istex, to step Corpus: 000363
to stream Istex, to step Curation: 000363
to stream Istex, to step Checkpoint: 000130
to stream Main, to step Merge: 000185
to stream Main, to step Curation: 000170

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct:series"><teiHeader><fileDesc><titleStmt><title xml:lang="en">You Don’t Have to Think Twice if You Carefully Tokenize</title>
<author><name sortKey="Klatt, Stefan" sort="Klatt, Stefan" uniqKey="Klatt S" first="Stefan" last="Klatt">Stefan Klatt</name>
</author>
<author><name sortKey="Bohnet, Bernd" sort="Bohnet, Bernd" uniqKey="Bohnet B" first="Bernd" last="Bohnet">Bernd Bohnet</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:6800DF6D171E421B4E2D105EF08CFA86A02E8475</idno>
<date when="2005" year="2005">2005</date>
<idno type="doi">10.1007/978-3-540-30211-7_32</idno>
<idno type="url">https://api.istex.fr/document/6800DF6D171E421B4E2D105EF08CFA86A02E8475/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000363</idno>
<idno type="wicri:Area/Istex/Curation">000363</idno>
<idno type="wicri:Area/Istex/Checkpoint">000130</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000130</idno>
<idno type="wicri:doubleKey">0302-9743:2005:Klatt S:you:don:t</idno>
<idno type="wicri:Area/Main/Merge">000185</idno>
<idno type="wicri:Area/Main/Curation">000170</idno>
<idno type="wicri:Area/Main/Exploration">000170</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">You Don’t Have to Think Twice if You Carefully Tokenize</title>
<author><name sortKey="Klatt, Stefan" sort="Klatt, Stefan" uniqKey="Klatt S" first="Stefan" last="Klatt">Stefan Klatt</name>
<affiliation wicri:level="3"><country>Allemagne</country>
<placeName><settlement type="city">Stuttgart</settlement>
<region type="land" nuts="1">Bade-Wurtemberg</region>
<region type="district" nuts="2">District de Stuttgart</region>
</placeName>
<wicri:orgArea>Institute for Intelligent Systems, University of Stuttgart, Universitätsstr. 38, 70569</wicri:orgArea>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Allemagne</country>
</affiliation>
</author>
<author><name sortKey="Bohnet, Bernd" sort="Bohnet, Bernd" uniqKey="Bohnet B" first="Bernd" last="Bohnet">Bernd Bohnet</name>
<affiliation wicri:level="3"><country>Allemagne</country>
<placeName><settlement type="city">Stuttgart</settlement>
<region type="land" nuts="1">Bade-Wurtemberg</region>
<region type="district" nuts="2">District de Stuttgart</region>
</placeName>
<wicri:orgArea>Institute for Intelligent Systems, University of Stuttgart, Universitätsstr. 38, 70569</wicri:orgArea>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Allemagne</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2005</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">6800DF6D171E421B4E2D105EF08CFA86A02E8475</idno>
<idno type="DOI">10.1007/978-3-540-30211-7_32</idno>
<idno type="ChapterID">32</idno>
<idno type="ChapterID">Chap32</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Most of the currently used tokenizers only segment a text into tokens and combine them to sentences. But this is not the way, we think a tokenizer should work. We believe that a tokenizer should support the following analysis components in the best way it can. We present a tokenizer with a high focus on transparency. First, the tokenizer decisions are encoded in such a way that the original text can be reconstructed. This supports the identification of typical errors and – as a consequence – a faster creation of better tokenizer versions. Second, all detected relevant information that might be important for subsequent analysis components are made transparent by XML-tags and special information codes for each token. Third, doubtful decisions are also marked by XML-tags. This is helpful for off-line applications like corpora building, where it seems to be more appropriate to check doubtful decisions in a few minutes manually than working with incorrect data over years.</div>
</front>
</TEI>
<affiliations><list><country><li>Allemagne</li>
</country>
<region><li>Bade-Wurtemberg</li>
<li>District de Stuttgart</li>
</region>
<settlement><li>Stuttgart</li>
</settlement>
</list>
<tree><country name="Allemagne"><region name="Bade-Wurtemberg"><name sortKey="Klatt, Stefan" sort="Klatt, Stefan" uniqKey="Klatt S" first="Stefan" last="Klatt">Stefan Klatt</name>
</region>
<name sortKey="Bohnet, Bernd" sort="Bohnet, Bernd" uniqKey="Bohnet B" first="Bernd" last="Bohnet">Bernd Bohnet</name>
<name sortKey="Bohnet, Bernd" sort="Bohnet, Bernd" uniqKey="Bohnet B" first="Bernd" last="Bohnet">Bernd Bohnet</name>
<name sortKey="Klatt, Stefan" sort="Klatt, Stefan" uniqKey="Klatt S" first="Stefan" last="Klatt">Stefan Klatt</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Ticri/explor/TeiVM2/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000170 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000170 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Ticri
   |area=    TeiVM2
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:6800DF6D171E421B4E2D105EF08CFA86A02E8475
   |texte=   You Don’t Have to Think Twice if You Carefully Tokenize
}}

This area was generated with Dilib version V0.6.31.
Data generation: Mon Oct 30 21:59:18 2017. Site generation: Sun Feb 11 23:16:06 2024

	Serveur d'exploration sur la TEI
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur la TEI

You Don’t Have to Think Twice if You Carefully Tokenize

You Don’t Have to Think Twice if You Carefully Tokenize

Source :

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri